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This paper examines issues related to the statistical power of impact estimates for 
experimental evaluations of education programs. We focus on “group-based” experimental 
designs, because many studies of education programs involve random assignment at the group 
level (for example, at the school or classroom level) rather than at the student level. The 
clustering of students within groups (units) generates design effects that considerably reduce the 
precision of the impact estimates, because the outcomes of students within the same schools or 
classrooms tend to be correlated (that is, are not independent of each other). Thus, statistical 
power is a concern for these evaluations. 

Until recently, evaluations of education programs where the student is the unit of analysis 
have often ignored design effects due to clustering; thus, many of these studies overestimated the 
statistical precision of their impact estimates (Hedges 2004). Consequently, there is currently 
much concern among education policymakers about how to interpret impact findings from 
previous evaluations of education programs, and how to properly design future experimental 
studies to have sufficient statistical power to estimate impacts with the desired level of precision. 
This is a pressing issue because of provisions in the Education Sciences Reform Act of 2002 
specifying, when feasible, the use of experimental designs to provide scientifically-based 
evidence of program effectiveness, and substantial taxpayer resources that are currently targeted 
to large-scale experimental evaluations of educational interventions by the Institute for 
Education Sciences (IES) at the U.S. Department of Education (ED). 

There is a large literature on appropriate statistical methods under clustered randomized 
trials. Walsh (1947) showed that if clusters are the unit of random assignment, then conventional 
analyses will lead to an overstatement about the precision of the results, and the problem 
becomes more severe as the heterogeneity across clusters increases. Cochrane (1963) and Kish 
(1964) discuss the calculation of design effects under clustered sample designs in terms of the 
intraclass correlation coefficient (ICC), which is the proportion of variance in the outcome that 
lies between clusters. In a seminal article, Cornfield (1978) first drew attention in the public 
health literature to the analytic issues presented by clustered randomized trials. Since that time, 
there have been extensive methodological developments in adjusting variance estimates for 
clustered designs (see, for example, the books by Donner and Klar 2000 and Murray 1998, and 
Raudenbush 1997). Much of this literature has focused on cluster randomized trails of medical 
and public health interventions (such as community intervention trials (Koepsell et al. 1992), 
interventions against infectious diseases (Hayes et al. 2000), and family practice research 
(Campbell 2000)). Despite this literature, however, Varnell et al. (2004) found that only about 15 
percent of the published impact studies that they reviewed in the public health field used 
appropriate methods to account for clustering; Ukoumunne et al. (1999) came to similarly 
pessimistic conclusions based on their review of publications in seven health science journals. 

Less attention has focused specifically on statistical power analyses in the education field. 
Bryk and Raudenbush (1992), Bloom et al. 1999, and Raudenbush et al. 2004 discuss appropriate 
statistical procedures and provide examples, but do not systematically consider statistical power 
issues for specific designs that are typically used to evaluate school interventions and that are 
based on up-to-date parameter assumptions. 

In this paper, we apply the analytic methods found in the literature to examine appropriate 
school sample sizes in random assignment evaluations of education interventions. We provide a 
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unified theoretical framework for examining statistical power under various types of commonly- 
used experimental designs that are conducted in a school setting, and discuss appropriate 
precision standards. In our discussion, we provide examples from recent large-scale experimental 
evaluations of education programs. We provide also empirical estimates of key parameters (such 
as intraclass correlations and regression R 2 values) that are required to estimate power levels. 
Using conservative values of these estimates, we conduct a power analysis for each of the 
considered designs. 

Our empirical analysis focuses on achievement test scores of elementary school and 
preschool school in low-performing school districts due to the accountability provisions of the 
No Child Left Behind Act of 2001. The Act mandates the annual testing of all students in grades 
3 to 8 and the development of initiatives to improve the literacy of preschool and K-3 children. 
Thus, there has been an ensuing federal emphasis on testing interventions to improve reading and 
mathematics scores of young students. Furthermore, more information exists to determine 
appropriate precision standards for student test scores than other student outcomes. Our analysis 
focuses also on designs with a single treatment and control group per site, which is the most 
common design used in education evaluations. Our methods, however, can be easily generalized 
to experimental designs with multiple treatment groups. 

This paper is in five sections. First, we discuss general issues for a statistical power analysis, 
including procedures for assessing appropriate precision levels. Second, we discuss reasons that 
a clustered design reduces the statistical power of impact estimates and provide a simple 
mathematical formulation of the problem. Third, we discuss procedures that can be used to 
reduce design effects. Fourth, we present power calculations for impact estimates under various 
design options and parameter assumptions. Finally, we present our conclusions. 



A. GENERAL ISSUES FOR A STATISTICAL POWER ANALYSIS 

An important part of any evaluation design is the statistical power analysis, which 
demonstrates how well the design of the study will be able to distinguish real impacts from 
chance differences. Precision levels for most evaluations of education interventions are a 
particularly important issue, because it is often the case that schools or classrooms are randomly 
assigned to a research condition rather than students, which generates design effects from the 
clustering of students within groups. 

In order to determine appropriate sample sizes for experimental evaluations, researchers 
typically calculate minimum detectable impacts, which represent the smallest program impacts — 
average treatment and control group differences — that can be detected with a high probability. In 
addition, it is common to standardize minimum detectable impacts into effect size units — that is, 
as a percentage of the standard deviation of the outcome measures. Researchers often scale 
nominal impact estimates into standard deviation units to facilitate the comparison of findings 
across outcomes that are measured on different scales. Hereafter, we denote minimum detectable 
impacts in effect size units as “MDEs.” 

This paper focuses on the calculation of MDEs. Next, we discuss the structure of MDEs and 
appropriate precision standards for standardized effect sizes. 
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1. Structure of MDEs 



MDEs represent the smallest program effects that can be detected with a high degree of 
confidence. MDEs are a function of the standard errors of the impact estimates, the assumed 
significance level (Type I error), the assumed power level (Type II error), and the number of 
degrees of freedom for conducting tests gauging the statistical significance of the program 
impacts. Mathematically, the MDE formula can be expressed as follows: 



(1) MDE = Factor (a, (3 , df) * ^jVar (impact) / <x, 



where Var (impact) is the variance of the impact estimate, o is the standard deviation of the 
outcome measure, and Factor(.) is a constant that is a function of the significance level (a), 
statistical power (/?), and the number of degrees of freedom (df). 1 Factor (.) becomes larger as 
the significance level is decreased and as the power level is increased. Appendix Table A.l 
displays values for Factor (f by the number of degrees of freedom, for one-tailed and two-tailed 
tests, at 80 and 85 percent power and a 5 percent significance level (which are typical 
assumptions that are used in MDE calculations). 

We note that equation (1) ignores the estimation error in the standard deviation (that is, it 
assumes that a is known). Hedges (2004) uses a more sophisticated ratio estimator that accounts 
for the estimation error in the standard deviation. His resulting variance formulas are very 
similar to the case where a is assumed to be known, except that it includes an additive correction 
factor that reflects the estimation error in a. This correction factor, however, is very small in 
most practical applications and also depends on the true (but unknown) effect size. Thus, for 
simplicity, we do not account for it in our presentation. 

Before discussing issues pertaining to Var(impact), we first discuss several issues pertaining 
to Factor(.) that affect our power calculations, including the use of one-tailed or two-tailed tests, 
accounting for multiple comparisons, and the number of degrees of freedom. 



a. Using a One-Tailed or Two-Tailed Test 

For a given significance level and power level, the use of one-tailed tests produces smaller 
MDEs than the use of two-tailed tests (see Appendix Table A.l). 2 This is because under a one- 
tailed test, the rejection region for the null hypothesis of no program impact is concentrated in 
only one tail of the distribution of the outcome measure, whereas the rejection region under a 
two-tailed test is concentrated in both the lower and upper tails of the distribution. 

1 Specifically, Factor() can be expressed as [T l (a) + T 1 (f)\ for a one-tailed test and {T 1 (a/2) + T 1 (fi)\ for a 
two-tailed test, where T 1 (.) is the inverse of the student’s t distribution function with df degrees of freedom (see 
Murray 1998 and Bloom 2004 for derivations of these formulas). 

2 The value of Factor(.) is the same for a two-tailed test at an a significance level and for a one-tailed test at an 
a/2 significance level. 
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For several reasons, however, our illustrative power calculations presented in this paper 
focus on two-tailed tests rather than one-tailed tests. First, it is often unclear a priori whether a 
particular intervention will improve all student outcomes. Second, a two-tailed test provides 
more conservative estimates to help guard against unexpected events that might reduce the size 
of the analysis samples. Third, researchers typically employ two-tailed tests when conducting 
statistical tests in impact analyses (even if one-tailed tests were used in the initial power 
calculations). 

We note that power calculations in program evaluations are sometimes conducted using one- 
tailed tests. The use of one-tailed tests is often justified on the grounds that an intervention 
should be supported only if it produces beneficial impacts, so that harmful impacts have the same 
policy significance as zero impacts. 



b. Adjusting Significance Levels for Multiple Comparisons 

MDE calculations are typically performed assuming a 5 percent significance level. 
However, this Type I error can be viewed as being too large when experiments test the relative 
effectiveness of more than one intervention by randomly assigning multiple treatments to units 
(such as schools or classrooms). This is because with multiple comparisons, the chance of 
finding any statistically significant impact, even when none actually exists, is much higher than 5 
percent. For example, suppose four different interventions and a control condition were 
randomly assigned to schools. In this example, there are 5(5- 1)/2 = 10 pairs of treatment and 
control group means to compare, each with a 5 percent probability of a Type I error. In this case, 
if several t-tests are performed, the probability that at least one of these tests is significant is 
much greater than five percent. For example, assuming independent t-tests, the probability that 
at least one of these 10 tests is significant is 40 percent [(1 - (1-.05) 10 ]. Although this estimate is 
an upper bound (because it assumes independent tests), it demonstrates that there is a good 
chance that the evaluation will conclude that a particular intervention is superior, when in fact, 
all interventions are indistinguishable from each other and from the control condition. This 
erroneous finding could have important policy ramifications. 

To correct for this multiple comparisons problem, the a level could be set lower than 5 
percent when calculating MDEs. A lower a level, however, increases Factor(), and hence, 
increases MDEs and the required sample sizes for the evaluation. One widely-used method is to 
use the Bonferroni inequality and to set the a level at 5 percent divided by the number of tests 
that are conducted. This approach is conservative because it assumes independent tests, but 
ensures that the probability of erroneously finding any significant impacts across the multiple 
tests will be less than 5 percent. Less conservative methods have been developed to adjust for 
correlations among the tests (see, for example, Ramsey 2002). 

Similar correction procedures could be used also in education evaluations that examine 
impacts on multiple outcome measures. The corrections could be made when examining 
outcomes within a similar domain or for priority outcomes. An alternative procedure is to use 
factor or cluster analytic techniques to construct a small number of composite outcome measures 
to help reduce the multiple comparisons problem. 



4 




Finally, a related issue concerns the estimation of impacts for subgroups defined by baseline 
student characteristics (such as gender, race/ethnicity, family income, baseline test scores, etc.), 
that are often calculated in experimental evaluations of education programs. Whether to adjust 
probability levels for multiple comparisons for these subgroup analyses depends on the research 
question. If the research question is, “Does the intervention work for a subgroup in isolation,” 
then a level corrections are not needed. On the other hand, if the research question is, “Does the 
intervention work better for one subgroup than another,” and if the program intends to use the 
subgroup results to target services to selected students only, then it is appropriate to make the 
multiple comparison corrections. 



c. Number of Degrees of F reedom 

As shown in Appendix Table A.l, Factor(.) is essentially constant if the number of degrees 
of freedom is relatively large (for a given a and j3). However, Factor (.) becomes larger if the 
number of degrees of freedom is small. For instance, for a two-tailed test at 80 percent power 
and a 5 percent significance level, Factor(.) is about 3.1 for 10 degrees of freedom, 2.9 for 20 

o 

degrees of freedom, and 2.8 for 100 degrees of freedom, but is 3.7 for 4 degrees of freedom. 
Factor (.) is about 7 percent larger for tests at 85 percent power than 80 percent power. 

In a nonclustered experimental design, where students within a given population are 
randomly assigned directly to a research group, the number of degrees of freedom, df N c , can be 
expressed as follows: 



(2) df NC = Total Number of Students - Number of Strata - 1. 



Thus, under this design, Factoif.) is effectively constant if the sample contains at least 25 or 30 
sample members, which is usually the case. 

Under a group-based design with a single treatment and control group, the number of 
degrees of freedom, dfc , is typically expressed as (Murray 1998): 



(3) df c = Total Number of Groups - Number of Strata - 1 . 



Thus, under a clustered design, Factor() does not vary if the number of groups is relatively large 
and if the number of strata is relatively small. The situation, however, is different if only a small 
number of groups (schools or classrooms) are randomly assigned to a research condition. In this 
case, Factoif.) becomes larger (that is, precision levels are reduced). 



3 The corresponding figures for a one-tailed test are somewhat smaller: 2.7 for 10 degrees of freedom, 2.6 for 
20 degrees of freedom, 2.5 for 100 degrees of freedom, and 3.1 for 4 degrees of freedom. 
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For example, in the Social and Character Development (SACD) Research Program 
(Schochet et al. 2004), about 10 elementary schools per site were randomly assigned to either a 
treatment group (who will offer a promising SACD intervention designed to improve positive 
social and character development) or to a control group (who will offer the current curriculum), 
with equal numbers of schools assigned to each research group. Furthermore, pairwise matching 
was used to select the treatment and control group schools (that is, five stratum of school pairs 
were formed, and one school within each pair was randomly assigned to the treatment group and 
the other to the control group). Thus, for the SACD evaluation, the number of degrees of 
freedom at the site level is 4 (10 schools minus 5 strata minus 1). Consequently, Factor(.) is 
about 3.7 rather than the typical 2.8 value, which has important power implications. 



2. Precision Standards 

A key issue for any evaluation is the precision standard to adopt for the impact estimates. 
There are two key factors that need to be considered in selecting a precision standard for a 
particular study. First, it depends on what impact is deemed meaningful in terms of future, 
longer-term student outcomes (such as high school graduation, college attendance, earnings, 
welfare receipt, criminal behavior etc.). Second, the precision standard should depend on what 
intervention effects are realistically attainable. 

These two factors will depend on the key study outcome measures and the study context. 
For example, in a medical trial where death is the key outcome, small impacts are clearly 
meaningful, whereas larger standardized effect sizes might be appropriate in education trials. 
Similarly, in terms of attainability, some student outcomes are harder to influence than others. 
For instance, it might be more difficult for an intervention to improve test scores than student 
attitudes, so smaller effect size targets are more appropriate for studies focusing on test scores. 

There is no uniform basis for adopting precision standards in educational research, and this 
critical issue has not been rigorously addressed in the literature, primarily because it is often 
difficult to determine what size impacts are “meaningful,” especially for young children. In this 
section, we discuss several procedures that can be used in practice. 



a. Examine Impact Results from Previous Evaluations 

One approach for adopting a precision standard is to use impact results found in previous 
evaluations similar to the one under investigation. For instance, to evaluate the impacts of a 
reading intervention on elementary school children, one could adopt a precision standard based 
on impact results from previous evaluations of similar reading interventions that were tested on a 
similar student population. This approach is appropriate if the previous impact studies produced 
credible results based on rigorous evaluation designs, and if the studies found beneficial and 
meaningful program impact estimates. 

Another widely-used approach is to use meta-analysis results from previous impact studies 
across a broad range of disciplines to examine the magnitude of impacts that have been achieved. 
Cohen (1988) suggested that effect sizes of .20 are small, effect sizes of .50 are moderate, and 
effect sizes of .80 are large. In an important study, Lipsey and Wilson (1993) examined the 
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distribution of effect size estimates reported in 9,400 studies (with more than 1 million individual 
subjects) testing the efficacy of various psychological, educational, and behavioral interventions. 
They found that one-third of the effect sizes were smaller than .32, one-third were between .33 
and .55, and one -third were between .56 and 1.20. 

Based on these studies, many evaluations of education programs adopt standardized effect 
sizes of .20, .25, or .33 as the precision standard. While this meta-analysis approach can be used 
to determine what impacts could be attainable for a particular intervention, it does not 
necessarily address what impacts are meaningful. As discussed next, we believe that these 
precision standards are somewhat high for testing the efficacy of education interventions on 
student test scores. 



b. Adopt a Benefit-Cost Framework 

One approach for assessing meaningful standardized effect sizes, and which suggests 
smaller benchmark precision standards are appropriate, is to select samples large enough to 
detect impacts such that program benefits would offset program costs. This approach could be 
used in studies where a dollar value can be assigned to key program benefits. For instance, 
several studies have indicated that a one standard deviation increase in either math or reading test 
scores for elementary school children is associated with about 8 percent higher earnings when 
the students join the labor market (Currie and Thomas (1999); Murnane, Willet, and Levy 
(1995); Neal and Johnson (1996)). Krueger (2000) estimates that the present discounted value of 
this higher earnings stream over a worker’s lifetime due to a one standard deviation increase in 
test scores is about $37,500. 4 ’ 5 Consequently, the present value of lifetime earnings would be 
$12,375 if the intervention improved test scores by .33 of a standard deviation, $7,500 for an 
impact of .20 standard deviations, and $3,750 for an impact of .10 standard deviations. Stated 
differently, if an intervention improved test scores by .20 standard deviations, then Krueger’s 
estimates suggest that program benefits would exceed program costs if the intervention cost less 
than $7,500 per pupil — which was roughly the nationwide total expenditures per pupil in 1997- 
98. Because most interventions are likely to cost less than $7,500 per pupil, these results suggest 
that a precision standard of .20 standard deviations might be too large from a benefit-cost 
standpoint. Stated differently, the evaluation could miss an effect worth finding if the precision 
standard was .20. 

These results, however, must be interpreted cautiously, because there is only a small 
literature on the long-term economic returns to test score increases for elementary school 
children. Furthermore, many of the studies cited above pertain to older students only, and it is 
likely that the test score-earnings relationship is stronger for older students than younger ones. 
For instance, Mumane, Willet, and Levy (1995) used data from the High School and Beyond 
survey to estimate the economic returns to test score increases using male high school seniors. 



4 This figure was calculated (1) using the age-earnings profile in the March 1999 Current Population Survey, 
(2) a 4 percent discount rate, (3) assuming workers begin wage at age 18 and retire at age 65, and (4) a productivity 
(wage) growth rate of 1 percent per year. 

5 Kane and Staiger (2002) find similar results. 



7 




Similarly, Neal and Johnson (1996) used the National Longitudinal Survey of Youth to estimate 
the effect of students’ AFQT scores at age 15 to 18 on the students’ earnings at age 26 to 29. 
Furthermore, although Currie and Thomas (1999) examined the relationship between test scores 
at age 7 and earnings at age 33, they used data from the British National Child Development 
Study. Thus, their results may not pertain to students in the United States. 

Consequently, although these studies suggest that relatively small test score gains for 
elementary school children are associated with relatively large lifetime earnings gains, these 
results must be deemed tenuous. Much more research is needed to examine the test score- 
earnings relationship using data collected on samples of pre-school and elementary school 
children as they enter adulthood and beyond. 



c. Examine the Natural Progression of Students 

Another approach is to adopt a precision standard based on the natural growth of student 
outcomes over time to get a sense of intervention effects that can be realistically attained and that 
are meaningful. This approach, however, can only be used for those outcome measures that can 
be compared over time and that naturally change over time. 

Several studies suggest that the test performance of elementary school students in math and 
reading grows by about .70 standard deviations per grade. Kane (2004) compared Stanford 9 
achievement reading and math test scores across elementary school grades (where the scores 
were “scaled” to allow comparisons of scores across grades). He found that test performance 
grew by approximately .70 standard deviations in math and .80 standard deviations in reading 
per grade level. However, the rise in test scores was smaller after third grade; between fifth and 
sixth grades, performance grew by only .30 standard deviations in both math and reading. We 
found similar results using scaled SAT-9 test score data from the Longitudinal Evaluation of 
School Change and Performance (LESCP) in Title I schools; the average reading and math test 
score gain between the third and fourth grades was about .70 standard deviations. 6 

Assuming that test score gains occur uniformly throughout the school year, an average test 
score gain of about .70 standard deviations suggests that a standardized effect size of .20 
corresponds to roughly 3 months of instruction (assuming a regular 10-month school year). This 
is a large impact given all else that is occurring in students’ lives. Thus, according to this metric, 
it might be appropriate to adopt a smaller, more attainable precision standard. For instance, an 
effect size of .10 corresponds to about 1 to 1.5 months of instruction. 



d. Examine the Distribution of Outcomes Across Schools 

Another metric for assessing an appropriate precision standard is to assess what an MDE 
implies about movements in mean student outcomes in a typical school relative to the 



6 Kane’s results are based on separate cross-sections of students (which could be affected by cohort effects), 
whereas the LESCP results are based on the same students over time. 
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distribution of outcomes across a broader set of schools. This approach again suggests that effect 
sizes of .20 to .33 are large. 

For instance, we analyzed California Achievement Test (CAT-6) data for third graders using 
data from the 2004 California Standardized Testing and Reporting (STAR) Program. Consider a 
school at the 25th percentile of the math or reading test score distribution. A 33 percent effect 
size implies that the intervention would move that school from the 25th to 37th percentile of the 
score distribution, which is a large increase. 7 Similarly, a 20 percent effect size would move that 
school to the 33rd percentile. A more attainable 10 percent effect size would move the school 
from the 25th to 29th percentile. LESCP data for SAT-9 scores of third graders in Title I schools 
yield similar findings. 

A related method is to assess the magnitude of MDEs by translating them into nominal 
impacts for binary outcomes (such as the percentage of students with test scores below a certain 
threshold level). For example, according to the National Assessment of Educational Progress 
(NAEP), nearly 70 percent of fourth graders nationally performed below the Proficient level in 
reading and math in 2003 (NCES 2004). For this binary outcome, effect sizes of .33, .25, and 
.20 translate into impacts of about 15.0, 11.5, and 9.2 percentage points, respectively. Stated 
differently, an effect size of .33 implies that the intervention must reduce the percentage of 
students scoring below the Proficient level from 70 to 55 percent, which is a large reduction. A 
smaller effect size of .10 translates into an impact of about 4.6 percentage points. 

In sum, there is no standard basis for assessing appropriate precision standards for 
experimental impact evaluations of education programs. A precision standard of between .20 
and .33 of a standard deviation is often used, and is justified on the basis of meta-analysis results 
across a range of fields. This approach also represents a reasonable compromise between 
evaluation rigor and evaluation cost. However, it must be viewed as somewhat ad hoc. Other 
methods suggest that smaller effect sizes are meaningful for examining intervention effects on 
test scores. 

Finally, our discussion has focused on precision standards for comparing one treatment 
group to one control group. It is more difficult to develop rules for adopting a precision standard 
for comparing across treatment groups in experiments with multiple treatments, because this will 
depend on the nature of the interventions being tested. However, we can expect impacts to be 
smaller when treatments are compared to each other than when a treatment is compared to the 
control condition. Thus, MDEs should be set lower for power analyses that focus on between- 
treatment contrasts. 



B. VARIANCE CALCULATIONS FOR GROUP-BASED EXPERIMENTAL DESIGNS 

As discussed, MDEs are proportional to the standard errors of the impact estimates. Under a 
group-based, clustered design, standard errors are typically larger than those under a 
nonclustered design of the same size, and thus, clustering usually increases MDEs. The 



7 The 25th percentile of the school distribution is 605 for reading and 602 for math, and the standard deviation 
of CAT/6 scores is about 20 scale points. 
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clustering of students within schools increases standard errors, because students who live in the 
same communities and face the same school environments tend be similar. The clustering of 
students within classrooms also increases standard errors, because of teacher effects and the 
possibility that schools group similar types of students in the same classrooms. Thus, the 
precision of estimates under a clustered design is reduced, because the variance expressions must 
account not only for the variance of outcomes across students, but also for the variance of 
average student outcomes across schools and across classrooms within schools. 

In this section, we present a simple, unified mathematical formulation to demonstrate the 
sources of variance under various types of designs that are typically used in impact evaluations 
of education programs. The designs are, in general, ordered from least to most clustered. To 
make the presentation concrete, we consider experimental designs where the following units are 
randomly assigned to a research status: 



• Students within sites (schools or districts) 

• Classrooms within schools 

• Schools within districts 



Clustering in these designs comes from two potential sources: (1) the random assignment of units 
to the treatment and control groups, and (2) the random sampling of units from a broader 
universe of units before or after random assignment takes place. As part of our presentation, we 
discuss the important issue of when it is appropriate in the variance calculations to treat group 
effects as random or fixed, which has important implications for the statistical power of the 
designs. 

For ease of exposition, we first demonstrate the variance formulas assuming that the 
evaluation sample is selected from a single participating site — such as a school or school district. 
We then indicate how to generalize the variance formulas when aggregating the sample across 
multiple sites to obtain pooled estimates. As discussed, we assume designs where a single 
treatment is tested against the control condition within each site. 

Table 1 summarizes the various designs that we consider, and displays equation numbers in 
the text for the variance formulas for each design. These designs can all be estimated using 
standard statistical packages (see, for example, Murray (1998) and Singer (1998)). 



1. Random Assignment of Students Within Sites: Fixed-Effects Case 

In some designs, students in purposively-selected schools or districts are randomly assigned 
directly to the treatment and control groups without regard to the classrooms or schools that the 
students attend. This design was used in the evaluation of the 21st Century Community Learning 
Centers Program (Dynarski et al. 2004), where interested students within each of the study 
schools were randomly assigned to a either a treatment group (who could attend an after-school 
program) or a control group (who could not). Another example of this design is the Impact 
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TABLE 1 



SUMMARY OF ALTERNATIVE DESIGNS 



Design Designation and Unit of 
Random Assignment 


Fixed or Random Site, 
School, and Classroom 
Effects 


Sources of 
Clustering 


Equation Numbers 
for Variance 
Formulas 


Design I: Students Within Sites 
(Schools or Districts) 


Fixed Site Effects 


None 


Equation 6 


II: Students Within Sites 


Random Site Effects 


Sites 


8, 10 


III: Students Within Sites 


Random Site and 
Subunit Effects 


Sites; Subunits b 


13 


IV: Classrooms Within Schools 11 


Fixed or No 
School Effects b 


Classrooms 


14 


V: Classrooms Within Schools 
( At Least 2 Classrooms per School) 


Random School Effects 


Schools, 

Classrooms 


15 


VI: Classrooms Within Schools 
(Only 2 Classrooms per School) 


Random or Fixed 
School Effects 


Schools 


15 with p 2 - 0 


VII: Schools Within Districts 


Fixed Classroom Effects 


Schools 


16 


VIII: Schools Within Districts 


Random 

Classroom Effects 


Schools, 

Classrooms 


17 



Note: All designs assume fixed school district effects, except for Designs I and II where sites are school 

districts. 

“This design is pertinent if (1) there are at least two treatment and control classrooms per school and school 
fixed effects are included in the analysis, or (2) if there is only 1 classroom per condition per school, but 
school fixed effects are not included in the analysis. 

b Subunits are classrooms when sites are schools, and schools when sites are school districts. 
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Evaluation of Charter Schools Strategies (Gleason et al. 2004) where, within each charter school 
area, students interested in attending a charter school will be randomly assigned to either a 
treatment group (who will be allowed to enroll in a charter school) or a control group (who will 
not). 



Clearly, these designs are not appropriate for testing classroom-based interventions where 
random assignment at the classroom or teacher level is required. Furthermore, these designs are 
appropriate only if potential “spillover” effects are expected to be small, so that students in the 
control group are expected to “receive” little of the intervention through their contact with 
students in the treatment group (that is, there is no “diffusion of treatments,” as denoted by Cook 
and Campbell 1979). 

Under these types of designs, an important issue is whether the variance calculations should 
account for school- or classroom- level clustering. There are two views on this issue. First, if the 
impact findings are to be generalized only to the specific classrooms and schools included in the 
study (the fixed effects case), then clustering is not present, even though sample members are 
grouped in the same classrooms and schools. This is because students in the treatment and 
control groups are expected to be spread across all classrooms and schools in the sample. Thus, 
bypassing the selection of classrooms or schools removes the link between students and 
classrooms/schools, and thus, direct inferences can be made about intervention effects that 
pertain only to students in the study samples. 

The other view is that the impact findings can be generalized to a broader population (or 
“superpopulation”) of classrooms and schools “similar” to the ones included in the study. In this 
view, students and teachers change over the short term, and the ones that are observed at a fixed 
time point are a representative sample from this larger population. In this case, the variance 
estimates should account for classroom- or school-level clustering. 

In this section, we consider designs without school- or classroom-level clustering — which 
we label nonclustered , stratified designs. First, we discuss the appropriate variance calculations 
for impact estimates within sites (strata), and then for impact estimates pooled across sites. 



a. Variance of Impact Estimates Within Sites 

Under a nonclustered, stratified design, the variance of an impact estimate within a site — 
that is, the variance of the difference between a mean outcome across the treatment and control 
groups — must account for between-student variance only, and can be expressed as follows: 



(4) Variimpact in site p ) = 




m 



p 
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where m p is the size of the treatment (control) group in site p and o p 2 is the variance of the 
outcome measure . 8 ' 9 



b. Variance of Pooled Impact Estimates 

Pooled impact estimates are often obtained in impact evaluations conducted in multiple sites 
in order to examine the extent to which, taken together, the tested interventions change student 
outcomes relative to what they would have been otherwise. In many instances, estimating pooled 
impacts is appropriate, because even in cases where the tested interventions differ somewhat 
across sites and serve different populations, the interventions are usually within the same general 
category (such as a reading or math curriculum, an after-school program, a technology, a charter 
or magnet school, a teacher preparation model, or a social and character development initiative), 
and often share common features and a common funding source. Thus, it is typically of policy 
interest to examine the overall efficacy of promising interventions within a general class of 
treatments, even though the results must be interpreted carefully, and site-specific impacts must 
be examined separately to assess whether the pooled impacts are driven by a small number of 
sites. 

A central issue in the variance calculations for the pooled estimates is whether site effects 
should be treated as fixed or random. For most evaluations of education programs, sites (such as 
schools or school districts) are purposive/y selected for the study for a variety of reasons (such as 
the site’s willingness to participate, whether the site has a sufficient number of potential program 
participants to accommodate a control group, and so on). In these instances, the variance 
calculations hinge critically on whether the pooled estimates are viewed as generalizing to the 
study sites only (the fixed effects case) or to a broader population of sites similar to the study 
sites (the random effects case). In the fixed effects case, between-site variance terms do not 
enter the variance calculations (because in repeated “sampling,” the same, fixed, set of sites 
would always be “selected”), unlike the random effects case where the study sites constitute a 
random sample, or a least a representative sample, from some larger population. 

Although this issue needs to be addressed for each study, we believe that the fixed effects 
case is usually more realistic in evaluations of education interventions. Most evaluations are 
efficacy trials where a relatively small number of purposively-selected sites are included in the 
study. Thus, in many instances, it is untenable to assume that the study sites are representative of 
a broader, well-defined population. Furthermore, inflating the standard errors to incorporate 
between-site effects will slant the study in favor of finding internally valid impact estimates that 
are not statistically significant, thereby providing less information to policymakers on potentially 
promising interventions. Instead, we believe, in general, that it is preferable to treat site effects 
as fixed, and to assess the “generalizability” of study findings by examining the pattern of the 

8 We assume equal numbers of treatment and control group students for illustrative simplicity, and because for 
a given total research sample size, a 50:50 split between the treatment and control groups yields the most precise 
estimates. We follow this approach for the remainder of this section, although we discuss unbalanced designs later 
in this paper. The formulas for unbalanced designs are very similar to the ones presented in this paper. 

9 We discuss the use of the finite sample correction (equal to 1 minus the proportion of the population being 
selected) in the variance calculations later in this paper. 



13 




impact estimates across sites (for example, by calculating the percentage of sites with beneficial 
impacts). This approach is likely to yield credible information on the extent to which specific 
interventions could be effective, and whether larger-scale studies are warranted to examine 
whether they are effective. 

Pooled impact estimates in the fixed-effects case are calculated as a weighted average of the 
impact estimates in each site. The associated variances are obtained by aggregating the site- 
specific variances in equation (4) as follows: 



(5) Var( pooled impact) = 







m 



p 



where w p is the weight associated with site p and where the weights sum to unity. Each site 
could be given equal weight in the analysis, or weights could be constructed to be inversely 
proportional to site-specific variances (Fleiss 1986). 

To reduce notation, and to facilitate comparisons with the other designs discussed below, we 
will refer to the following simplified version of equation (5): 



(6) Var (pooled impact) = 



2<t 2 

sm 



where a is the average variance across the s sites, and m is the average number of treatment or 
control group members per site. 

To further demonstrate the appropriate calculations in the fixed-effects case, the following 
regression (ANOVA) model can be used for estimating pooled impacts across sites (schools or 
districts): 



m i ;=A+EV5„+ IX<A; »%) + e„ 



p = 2 



p = 1 



where Y ip is a continuous, posttreatment outcome measure for student i nested in site p (nested in 
the treatment or control condition), D ip is an indicator variable equal to 1 for those in site p, T ip is 
an indicator variable equal to 1 for treatment group members, and e ip are assumed to be lid 
N(0,c?) student-level random error terms. In this model, site- specific impacts are treated as fixed 
(not random) and are represented by the A/ p parameters. Pooled impact estimates are then 
calculated as a simple (or weighted) average of the site-specific impact estimates, and similarly 
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for the estimated variances. Thus, in this model, the variance estimates are not inflated to 
account for between-site effects. 

We note that the fixed versus random effects issue is more complex in instances where a 
large number of purposively-selected sites are included in the study. For example, in the Impact 
Evaluation of Charter School Strategies (Gleason et al. 2004), a large number of geographically- 
dispersed school districts will be included in the evaluation. If available data indicate that the 
characteristics of these school districts are similar to the larger population of school districts with 
charter schools, then it might be reasonable to include between-district effects in the variance 
calculations. However, even in these instances, we believe that this approach should be 
supplementary to the primary approach discussed above, and should be used only to check the 
robustness of study findings. 

Finally, for several reasons, it might be appropriate to account for between-site variance 
terms when conducting the power analysis during the design phase of an evaluation. First, this 
approach provides a guide to the number of sites that should be selected for the study. This is 
important, because for a given sample size, the fixed-effects approach generates the same 
precision levels for a design with many sites and only a few students per site, and a design with a 
smaller number of sites but with more students per site. Second, incorporating between-site 
variance terms is conservative, because it will generate larger sample sizes to help guard against 
unexpected events that could reduce the size of the analysis sample during the follow-up period. 



2. Random Assignment of Students Within Sites: Random-Effects Case 

In this section, we consider designs where students are randomly selected to a research 
condition within sites, and where site effects are treated as random. This random-effects case can 
occur in two ways. First, as discussed, purposively-selected sites could be considered 
representative of a broader population of similar sites. Second, in some evaluations, sites are 
randomly sampled from a larger pool of sites. This type of design is typically employed in large- 
scale studies of a well-established program or intervention that require externally-valid impact 
estimates (and where the burden of evidence of program effectiveness is set high). For example, 
for the national evaluation of Upward Bound (Myers et al. 1999), a nationally representative 
sample of eligible program applicants was selected in two stages. In the first stage, a random 
sample of Upward Bound sites (projects) was selected from all sites nationwide, and in the 
second stage, students within each of the selected sites were randomly assigned to either a 
treatment or control group. For this evaluation, the impact results are generalizable to all 
Upward Bound projects nationwide. Similarly, for the National Job Corps Study (Schochet et al. 
2001), all eligible Job Corps applicants nationwide in 1995 were randomly assigned to a research 
condition. 

In these random-effects designs, study results can be generalized more broadly than in the 
fixed-effects designs. However, this generalization involves a cost in terms of precision levels: 
the variance formulas must be inflated to account for between-site effects. Stated differently, site 
effects must be treated in the variance formulas as random , not fixed. Intuitively, in repeated 
sampling, a different set of sites would be selected for the evaluation, which could influence the 



15 




impact findings. Hence, the variance expressions must account for the extent to which mean 
student outcomes vary across sites. 

To illustrate the variance calculations under a random-effects design, we first consider the 
scenario where (1) site (school or district) effects are random, (2) students within sites are 
randomly selected to a research condition, and (3) there is no clustering of students within 
subunits (classrooms or schools). In this case, the variance formula for a pooled impact estimate 
can be expressed as follows: 



<J 2 2<r 2 

(8) Var (pooled impact) = — — H — 

s sm 



where s is the number of sites in the sample, m is the average number of treatment or control 
group members in each site, o" e is the variance of the outcome measure for students within 
sites, and o 2 - is the variance of the impacts (treatment effects) across sites. 

The within-site variance term in equation (8) is the conventional variance expression for an 
impact estimate under a nonclustered design (see equation (6)). Design effects in a clustered 
design arise because of the first variance term (that is, the between-site term), and can be large 
because the divisor in this term is the number of sites rather than the number of sample members. 
Thus, precision levels can usually be improved by selecting more sites (for example, schools) 
and fewer students per site (to the extent that project resources allow). The optimal allocation of 
sites and students can be obtained by minimizing equation (8) subject to a budget constraint that 
includes unit costs of including an additional site and an additional student (Raudenbush 1997). 

To make equation (8) more operational for our power calculations (and for purposes of 
comparing variance formulas across other designs), we use the following expression for o 2 T : 



(9) <7 2 = 2rr 2 (l- Cl ), 



2 

where <J u is the variance of the mean outcome measure (not impacts) across sites (which is 
assumed to be equal for the treatment and control groups), and cj is the correlation between the 
treatment and control group means within a site. 10 This correlation is likely to be positive 
because students in the same site (for example, school) are likely to have similar characteristics, 
have similar teachers, and face similar environments. 



10 This expression can be derived by noting that the variance of the impact across sites is the sum of the (1) the 
variance of the mean outcome for the treatment group across sites; (2) the variance of the mean outcome for the 
control group across sites (which is assumed to be roughly the same as that for the treatment group in step (1)); and 
(3) -2 times the covariance of the treatment and control group means within a site. 
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If we insert equation (9) into equation (8), and define pi as the between-site variance in the 
outcome measure (<r 2 U ) as a proportion of the total variance of the outcome measure (cr), then 
the variance formula can be expressed as follows: 



(10) Var(pooled impact) = — — + 

s 



2cr 2 (l-p l ) 

sm 



2=2 2 

where o~ <7 u +cr e . The term, P/ , is the intraclass correlation (ICC), which tends to be large if 
mean student outcomes vary considerably across sites, and tends to be small if site means are 
similar. 

In this formulation, the design effect from clustering is small (that is, near 1) if either the 
mean of the outcome measure does not vary across sites (that is, if pi is small), or if the 
correlation between the treatment and control group means within a site is large and positive 
(that is, if ci is near 1). A large correlation implies that impacts do not vary across sites. 

The variance expressions in equations (8) and (10) can be derived using the following two- 
level hierarchical linear (HLM) model (Bryk and Raudenbush 1992): 



(1 1) Level 1 :Y ip = \ )p + \ p T ip + e ip 
Level 2 : A 0p = A 0 + u p 
+ v 



where Level 1 corresponds to students and Level 2 corresponds to sites (schools or districts). 
The term, Y ip , is the continuous outcome measure for student i in site p: T ip is an indicator 
variable equal to 1 for treatment group members and 0 for controls; u p are assumed to be iid 
N( 0, <T P ) site-specific random error terms; T p are iid N(0, <J 9 error terms which represent the 
extent to which treatment effects vary across sites ; e ip are iid N(0, cr e ) within-site error terms that 
are distributed independently of u p and T p \ and the X terms are parameters. 

Inserting the Level 2 equations into the Level 1 equation yields the following unified 
regression model: 



(i2) i; ='•» + '«, + [“,+7,*, +«,]■ 



In this formulation, A/ represents the pooled impact estimate (that is, [Y.. T - Y..c ] where Y.. T and 
Y.. c represent mean outcomes for the treatment and control groups, respectively), and its 
associated variance is ( (f fs +cr e /ms) which is identical to equation (8). In this model, the 
random school and treatment effects — u p and T p — are a component of the error structure and 
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account for the clustering of students within sites. This is very different from the fixed-effects 
specification (see equation (7)) where site effects are not included in the error structure, but are 
treated as fixed parameters in the regression model. 

The variance formulas presented above can be easily generalized to account also for 
additional levels of clustering within sites. For instance, if classrooms in study schools were 
considered to be representative of a broader population of classrooms in these schools, then this 
design could be represented as a three-level HLM model, where Level 1 corresponds to students 
(the level of random assignment), Level 2 to classrooms, and Level 3 to schools. This design 
effectively treats students as if they were randomly assigned to the treatment and control groups 
within classrooms. This framework yields the following variance formula: 



(13) Var( pooled impact) = ^ — — + 

s 



2cr 2 p 2 (l-c 2 ) + 2<r 2 p x - p 2 ) 
sk sk(.5n ) 



where p 2 is the between-classroom effect, C 2 is the correlation between the outcomes of treatment 
and control group students within classrooms, k is the average number of classrooms per school, 
n is the average number of students per classroom (split evenly between the treatment and 
control groups), and where other parameters are defined as above. This expression accounts not 
only for the extent to which treatment effects vary across schools, but also the extent to which 
treatment effects vary across classrooms within schools. In this random-effects framework, 
additional variance terms could be included to account for potential “treatment-induced” 
correlations between the outcomes of treatment group members if the intervention is 
administered in small groups, thereby creating potential correlations between the outcomes of 
treatment group members within each small group (Murray et al. 2004; Raudenbush 1997). 
These correlations could be modeled as another level in the HLM framework. Finally, additional 
variance terms at the level of the school district could also be included in the variance formulas if 
district effects were treated as random. 

Importantly, simulation studies (Murray et al. 1996) suggest that that Type I errors for 
statistical tests of intervention effects are similar if the variance expressions account for 
clustering only at the highest level of clustering, and if they account also for clustering of 
intermediate nested subunits. These findings suggest that empirical results based on equations 
(10) and (13) could be similar (as long as covariates at the classroom level are not included in the 
regression models). 

Finally, the above analysis suggests that in some evaluations, researchers face a conundrum 
about whether or not to randomly select study sites. For example, suppose an evaluation is being 
conducted in a small purposively- selected city. Furthermore, suppose that all schools in that city 
agree to participate in the study. In this case, one can argue that the study schools should be 
selected randomly, so that the impact results can be generalized to all schools in that city. On the 
other hand, for a fixed sample size, selecting schools randomly rather than purposively will 
inflate the standard errors of the impact estimates, which will reduce the chance that the study 
will find statistically significant impact estimates. Furthermore, one might argue that in an 
efficacy study, it might not make much difference from a policy standpoint whether the results 
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can be generalized to all schools in the small city or to only those schools that are selected for the 
study. Clearly, the choice of whether to select sites randomly or not will depend on the scope 
and objectives of the study. However, in making this important design decision, it is important 
to consider the tradeoff between statistical power and the generalizability of study findings. 



3. Random Assignment of Classrooms Within Schools 

A design that is commonly used in evaluations of school interventions is when classrooms 
or teachers within study schools are randomly assigned to the treatment or control groups. For 
example, in the Evaluation of the Effectiveness of Educational Technology Interventions 
(Dynarski et al. 2004), teachers in participating schools will be assigned at random to use a 
technology intervention or not. This type of design is appropriate for interventions that are 
administered at the classroom level and where potential spillover effects are deemed to be small. 

One way to interpret this design is that a “mini-experiment” is being conducted within each 
school. Under a design with purposively-selected schools and where school effects are treated as 
fixed, pooled impact estimates across schools are calculated as a simple or weighted average of 
the impact estimates from each mini-experiment. Accordingly, the variance formula for these 
pooled impact estimates can be expressed as follows: 



(14) Var(pooled impact) = ^ — fh)_ 

s(.5k) s{.5k)n 



where s is the total number of schools in the sample, k is the number of classrooms per school 
(split evenly between the treatment and control groups), n is the average number of students per 
classroom, and p 2 is the between-classroom variance as a proportion of the total variance. 11 
Design effects arise because of the between-classroom variance term. 

Several important features of this variance formula are worth mentioning. First, in some 
evaluations, all children in the study classrooms are included in the study. Under the fixed 
effects scenario, one could then argue that student effects should not be included in the variance 
calculations (because there is no sampling of students within classrooms). However, it is 
customary to include these student-level terms, because it is usually the case that some children 
will not provide follow-up data due to study nonconsent, attrition, and interview nonresponse. 
Thus, students in the follow-up sample are often considered to be representative of a larger pool 
of students in the study schools. 

Second, in some evaluations, students within each of the participating schools and grades are 
randomly assigned to classrooms at the start of the school year. For example, for the Teach For 
America (TFA) Evaluation (Decker et al. 2004), students within each of the study schools and 



11 The variance of an impact estimate within a single school can be obtained by setting s equal to 1 in equation 

04). 
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grades were randomly assigned to classrooms taught by TFA teachers or to classrooms taught by 
other teachers. This design ensures that the average baseline characteristics of students in the 
treatment and control group classrooms are similar. While this design reduces classroom effects, 
it does not remove them. This is because classroom effects arise from two sources: (1) 
differences in the quality of teachers within schools, and (2) systematic differences in the types 
of children who are assigned to different classrooms. The random assignment of children to 
classrooms reduces the second source of variance, but not the first source. 

Third, if school effects are treated as random, then both school-level and classroom- level 
clustering are present. Using results from the previous section, the variance expression can be 
expressed as follows: 



(15) Var(pooled impact ) = — — + 

s 



2& 2 Pi + 2cr(l -p x -p 2 ) 
s(.5k ) s(.5k)n 



where c$ is the correlation between the treatment and control group classroom means within a 
school, and where other parameters are defined as above. For two reasons, this variance is larger 
than the corresponding variance in equation (13) under the design where students are the unit of 
random assignment. First, twice as many treatment and control group classrooms are in the 
sample when students are the unit of random assignment. Second, unlike equation (15), the 
classroom- level term in equation (13) is deflated by the correlation between the outcomes of 
treatment and control group students within the same classrooms. 

Finally, due to limitations in the number of available classrooms, it is often the case that 
only one treatment and control classroom can be selected per school. In this case, there are not 
enough degrees of freedom to estimate between-classroom effects within schools, which are 
confounded with between-school effects (Murray 1998). One approach is to set pi to zero in 
equation (15) and to use the resulting variance formula for either the random or fixed effects 
specifications. Another possibility for the fixed effects specification is to use equation (14) and 
ignore the stratification by school (that is, by not including fixed school effects in the regression 
models). In this case, the between-classroom effect is estimated by combining classrooms across 
schools, which could increase design effects due to clustering. To mitigate these precision losses, 
another approach is to combine similar schools into larger strata, thereby making it possible to 
estimate between-classroom effects within stratum. 



4. Random Assignment of Schools 

In some designs, schools within districts are randomly selected to the treatment and control 
groups. These designs are necessary for testing interventions that are school-based. For instance, 
the Social and Character Development (SACD) Research Program is testing, in seven sites, 
promising interventions designed to promote positive social and character development and 
prevent negative behaviors among elementary school students. In each site, 10 to 18 schools 
were randomly assigned to either a treatment group (who are offering the SACD intervention) or 
to a control group (who are offering the current curriculum), with equal numbers of schools 
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assigned to each research group. This design is necessary to avoid contamination of the control 
group, because the SACD interventions include components aimed at changing schoolwide 
outcomes. Another example is the Evaluation of the Impact of Teacher Induction Programs 
(Johnson et al. 2005) where two models of high-intensity teacher induction are being tested in 20 
high-poverty, large school districts across the country. Within each district, 10 elementary 
schools will be randomly selected to implement the high-intensity program, and 10 will be 
randomly selected to continue to receive whatever induction program their respective districts 
normally provide. 

Next, we discuss the variance formulas for impact estimates under school-based 
experimental designs with and without classroom effects. We focus on designs where school 
districts volunteer for the study (that is, are selected purposively) and where district effects are 
treated as fixed, not random. 



a. Clustering at the School Level Only 

For a school-based experimental evaluation, one design option is not to sample classrooms 
within the treatment and control group schools. For this option, either all relevant classrooms in 
the selected schools are included in the research sample, or students are sampled directly to the 
research sample without regard to the classrooms that they are in. For example, under the SACD 
design, all consenting students in third-grade classrooms were included in the research sample. 

In these designs, if the impact findings are to be generalized only to the study schools and 
classrooms at the time of sampling, there is clustering at the school level, but not at the 
classroom level. Intuitively, if sampling were repeated, a different random allocation of schools 
would be selected to the treatment and control groups, but not a different set of classrooms 
within schools. Consequently, the variance of an impact estimate within a district can be 
expressed as follows: 



(16) Var (impact in a district ) = 



. 5 ^ 



+ 



2<x 2 (1-/7,) 
(.5 s)kn 



where all parameters are defined as above. 

The variance estimates under this school-level design are larger than those previously 
considered for two main reasons. First, there are now half as many treatment (control) schools 
(because random assignment occurs between schools rather than within them). Second, the 
between-school variance term is no longer deflated by the correlation between the treatment and 
control group means within schools. 

Pooled impact estimates across districts can be calculated as a simple or weighted average of 
the district- specific impact estimates, and similarly for the associated variance estimates. The 
treatment of district effects as random would introduce additional design effects, because the 
variance formulas would need to contain district-level variance terms. 
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b. Clustering at the School and Classroom Level 

For a school-based experimental evaluation, there could also be clustering at the classroom 
level. This would occur if, to conserve project resources, classrooms were sampled within the 
study schools, or if the full set of classrooms in the study schools were considered to be 
representative of a larger population of classrooms in those schools. 

In the presence of both school- and classroom-level clustering, the variance formula can be 
now expressed as follows: 



(17) Var (impact in a district ) = 



.5s 



2 o z p2 + 2a 2 (l-p ] -p 1 ) 
(.5 s)k (.5 s)k*n 



4c 

where k is the number of sampled classrooms, and all other parameters are defined as above. 
Design effects arise in this design from the first and second variance terms, and hence, are larger 
than in the previous design with clustering at the school level only. It is noteworthy that neither 
the school- or classroom-level terms are deflated by correlations between the outcomes of the 
treatment and control groups. Additional variance terms are required if school district effects are 
treated as random. We note again, however, that there is some empirical evidence that in multi- 
stage clustered designs, variance estimates are similar if the variance formulas account for 
clustering at the highest level only or if they account also for clustering at lower levels (Murray 
et al. 1996). 



5. Estimating Correlations 

A critical issue for the MDE calculations is what estimates to use for the following 
correlations that enter the variance formulas: 



• pi = The extent to which mean outcomes differ across schools (that is, the ICC at the 

12 

school level) 

• p 2 = The extent to which mean outcomes differ across classrooms within schools (that 

is, the ICC at the classroom level) 

• ci = The correlation between the mean outcomes of treatment and control group 

students within schools 

• C 2 = The correlation between the mean outcomes of treatment and control group 

students within classrooms 



12 We have also considered designs that require ICCs at the district level (see Design II in Table 1). However, 
for this paper, we focus on ICCs at the school level, because this type of design is more common, and there is more 
empirical evidence on ICCs across schools than across school districts. 



22 




• C 3 = The correlation between the mean outcomes of treatment and control group 
classrooms within schools 



As discussed, for policy reasons and the current research emphasis at ED, we focus our 
presentation on obtaining plausible correlation values for standardized math and reading test 
scores of elementary school and preschool students in low-performing schools. For context, we 
also discuss plausible correlation values for behavioral outcomes. 



a. Intraclass Correlations 

To obtain plausible values for pi for student achievement measures, we examined results 
found in the literature, and performed new tabulations using reading and math test score data 
from several recent evaluations conducted by Mathematica Policy Research Inc. (see Table 2). 

We find that ICCs for standardized test scores vary somewhat by data source, and differ 
somewhat by grade level. The ICCs, however, typically become smaller when adjusted for 
district fixed effects, because these figures pertain to ICCs within districts rather than across all 
districts. Hedberg et al. (2004) show also that ICCs vary by region of the country and urban/rural 
status, although the pattern of the estimates across subgroups is not always clear. Consequently, 
the ICCs that are applicable for a specific power analysis will depend on the study context, and, 
in particular, on the homogeneity of the schools in the sample. 

Nonetheless, the examined data sources suggest that values for pi often range from .10 to .20 
for standardized test scores. Thus, in our illustrative power calculations below, we use the 
midpoint, .15, as a reasonable approximation for pi. Because of the uncertainty in this 
parameter, however, we also present selected calculations assuming a more optimistic value of 
.10 and a less optimistic value of .20. 

There is less evidence on plausible p 2 values because there are fewer data sources that have 
student-level data on multiple classrooms within schools (within a treatment condition). LESCP 
and TFA data suggest values of about .16 for p 2 . Thus, p 2 values appear to be similar to pi 
values. Stated differently, mean student test scores tend to differ as much across classrooms 
within schools as they do across schools. This could be due to the fact that the examined data 
sources contain relatively homogenous schools in low-income districts and with low aggregate 
test scores. Thus, differences in teacher quality within these schools might have a large effect on 
student academic achievement. Our estimates for p 2 , however, are based on only a small number 
of data sources, and an important future research topic is to estimate ICCs at the classroom level 
using additional data sources. In our power calculations, we assume the same values for p 2 and 

Pi- 



Finally, for context, we examined the much larger literature on ICCs based on behavioral 
outcomes (see Murray et al. 2004 for a review). Siddiqui et al. (1996) present ICCs from a study 
of smoking prevention programs based on 6,695 seventh-graders in 287 classrooms from 47 
schools. Outcomes examined include students’ knowledge of health and tobacco, student’s 
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TABLE 2 



INTRACLASS CORRELATION ESTIMATES FOR STANDARDIZED TEST SCORES ACROSS 
ELEMENTARY SCHOOLS AND PRESCHOOLS, BY DATA SOURCE 



Data Source 


Description of Data 


Standardized Test 
Measure 


Grade and 
Year 


ICC Estimate 


Elementary Schools 


Longitudinal Evaluation of 
School Change and 
Performance (LESCP) 


71 Title I Schools in 18 
school districts in 
7 states 


Stanford 9 


3rd in 1997; 
4th in 1998; 
5th in 1999 


Unadjusted 

3rd: Math: .13 
3rd: Reading: .13 
4th: Math: .24 
4th: Reading: .19 
5th: Math: .18 
5th: Reading: .21 










Adjusted for 
District Effects 

3rd: Math: .08 
3rd: Reading: .06 
4th: Math: .07 
4th: Reading: .07 
5th: Math: .11 
5th: Reading: .11 


Prospects Study: Figures 
Reported in Hedberg et al. 
(2004) 


372 Title I schools in 
120 school districts 


Comprehensive Test 
of Basic Skills (CTBS) 


3rd in 1991 


Unadjusted 

Math: .23 
Reading: .20 










Adjusted 2 

Math: .16 
Reading: .18 


National Education 
Longitudinal Study (NELS): 
Figures Reported in Hedberg 
et al. (2004) 


1,052 schools 


NELS: 88 Test Battery 


8th in 1988 


Unadjusted 

Math: .24 
Reading: .17 

Adjusted 2 

Math: .12 
Reading: .08 


Teach for America 
Evaluation 


17 schools in six cities 
(Baltimore, Chicago, 
Los Angeles, Houston, 
New Orleans, and the 
Mississippi Delta) 


Iowa Test of Basic 
Skills (ITBS) 


2nd to 4th 
in 2003 


Unadjusted 

2 nd : Math: .10 
2 nd : Reading: .23 
3 rd : Math: .03 
3 rd : Reading: .05 
4 th : Math: .16 
4 th : Reading: .16 


21st Century Community 
Learning Centers Program 


30 schools in 12 school 
districts 


SAT- 9 


1st, 3rd and 
5th in 2002 


Unadjusted 

1 st : Math. 17 
1 st : Reading .19 
3 rd : Math. 19 
3 rd : Reading .24 
5 th : Math .17 
5 th : Reading .09 
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TABLE 2 ( continued ) 



Data Source 


Description of Data 


Standardized Test 
Measure 


Grade and 
Year 


ICC Estimate 


Data from Rochester: 
Figures Calculated from 
MDEs Reported in Bloom et 
al. (1999) 


25 elementary schools 


Pupil Evaluation 
Program (PEP) Test 


3rd and 6th 
in 1992 


Unadjusted 

3 rd : Math. 19 
3 rd : Reading .18 
6 th : Math. 19 
6 th : Reading .14 


Data from Louisville: 
Figures Reported in Gargani 
and Cook (2005) 


22 schools 


KCCT developed for 
Kentucky students 


Grade not 
reported: 
2003 


Reading: .11 


Preschools 


Early Reading First 
Evaluation 


162 preschools in 68 
sites 


Expressive One Word 
Picture Vocabulary 
(EOW) Test; PLS 
Auditory 


4-year-olds 
in 2004 


Unadjusted 

PLS: .18 
EOW: .14 

Adjusted for 
District Effects 

PLS: .08 
EOW: .08 


FACES 2000 


219 centers in 43 Head 
Start Programs 


PPVT; Woodcock 
Johnson Applied 
Problems (WJMATH); 
Woodcock Johnson 
Letter- Word 
Identification 
(WJWORD) 


4-year-olds 
in fall 2000 


Unadjusted 

PPVT: .38 
WJMATH: .13 
WJWORD: .16 

Adjusted for 
District Effects 
PPVT: .11 
WJMATH: .06 
WJWORD: .03 


Early Head Start Evaluation 


Families in 17 Early 
Head Start Programs 


Bayley MDI; PPVT 


3-year-olds 
between 
1996 and 
1999 


Unadjusted 

Bayley: .19 
PPVT: .18 


Preschool Curriculum 
Evaluation (PCER) 


113 preschools across 7 
PCER grantees 


PPVT 


4-year-olds 
in 2004 


Unadjusted 

PPVT: .20 


Early Childhood 
Longitudinal Study: Figures 
Reported in Hedberg et al. 
(2004) 


1 ,000 public and 
private kindergartens 


ECLS-K 


Kinder- 
garteners in 
spring 1999 


Adjusted 3 

Math: .17 
Reading: .23 



Note: Tabulations were conducted using SAS PROC MIXED. 
3 Adjusted for SES, race, and gender. 
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knowledge of social influences and resistance skills, and the prevalence of student smoking. 
Their analysis suggests a wide range of intraclass correlations, with p 1 ranging from .01 to .09 
and p 2 ranging from .04 to .14 on the three outcomes listed above. Similarly, Aber et al. (1999) 
found in the evaluation of the Resolving Conflict Creatively (RCCP) Program that intraclass 
correlations for reports of students’ aggressiveness, pro-social behavior, and hostile attribution 
bias ranged from about .02 to .06. Murray et al. (2003) found similar estimates based on 1,881 
ICCs from 17 studies across a variety of outcome measures (tobacco, drug, and alcohol use; diet 
and nutrition, general health and personal factors). Finally, Ukoumunne et al. (1999) found 
using data from the Health Survey of England that ICCs for lifestyle risk factors were generally 
below .01 at the district health authority level. Thus, ICCs for behavioral outcomes appear to be 
somewhat smaller than those for academic outcomes. 



b. Correlations Between Treatment and Control Group Means Within Schools 

The parameters, a to cj measure the correlation between treatment and control group 
outcomes under designs where school effects are treated as random. It is more difficult to 
estimate values for these correlations than for pi and p 2 , because they depend on the relative 
effectiveness of the tested interventions across sites. However, these correlations tend to be 
positive and large, because intervention effects do not typically vary substantially across sites. 
For instance, in the evaluation of the 21st Century Community Learning Centers Program, the 
value of a for math and reading test scores was about .85 across elementary schools and .70 
across middle schools. Similarly, in the evaluation of the School Dropout Demonstration 
Assistance Program, the value of a for student grades was .80. Finally, in the Early Head Start 
evaluation, a was about .80 for Bay ley scores and .70 for the Mac Arthur CDI. However, 
because of the uncertainty of this correlation, we assume a conservative value of .50 for cj in our 
power calculations below, and assume the same value for c?. 

We have less information on plausible values for cj. ITBS test score data from the TFA 
evaluation suggests a value of about .50 for cj. However, we assume a more conservative value 
of .30 in our power calculations to reflect the uncertainty in this parameter. 



C. WAYS TO IMPROVE PRECISION UNDER A CLUSTERED DESIGN 

As discussed, clustering at the school and classroom levels substantially reduces the 
precision of estimates. There are, however, several design and estimation strategies that can be 
used in clustered designs to reduce design effects. In this section, we discuss these strategies. 



1. Using a Balanced Sample Size Allocation 

For a given total sample size of schools, classrooms, and students, a 50-50 split of the 
treatment and control groups yields more precise estimates than other splits. Bloom (2004) 
demonstrates, however, that precision levels do not erode substantially unless the proportion of 
the total sample that is allocated to the treatment or control groups exceeds 80 percent or is less 
than 20 percent. This is an important finding, because selecting a larger control group could 
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reduce study costs associated with implementing the tested interventions. Conversely, a larger 
treatment group sample might be preferred, because district and school staff might be more 
willing to participate in a random assignment study if the size of the control group is as small as 
possible. Furthermore, larger treatment groups increase the precision of impact estimates for 
subgroups defined by program experiences and program features. Nonetheless, a balanced 
allocation produces the most precise estimates, and thus, many evaluations adopt this design. 

Another reason to adopt a balanced sample allocation is that statistical tests under this 
design are robust to deviations from the usual assumption that the variances of the outcome 
measures are the same for the treatment and control groups. Traditional t-tests are strictly valid 
only under the homoescadicity assumption that treatment and control group variances are the 
same. However, if the variances differ (because of intervention effects on the distribution of the 
outcome variables), the literature suggests that t-tests are approximately valid under balanced 
sample allocations, but are not valid under unbalanced sample allocations (Snedecor 1956; Gail 
et al. 1996). 



2. Using Stratified Sampling Methods 

The use of stratified sampling methods to select treatment and control groups can reduce 
design effects. This is because under a stratified design, the ICCs pertain to clustering effects 
within strata (assuming that fixed stratum effects are included in the regression or ANOVA 
models). Thus, to the extent that strata are formed using group-level measures that are correlated 
with the outcome measures, stratified sampling will diminish clustering effects. 

Many of the designs that we have considered in this paper are stratified designs where 
random assignment occurs within fixed strata defined by purposively-selected school districts or 
schools. Additional stratification can further reduce design effects. For instance, under a design 
where classrooms within schools are randomly assigned to a research condition, classroom strata 
could first be formed based on available teacher characteristics, and the treatment and control 
groups would then be selected within each strata. As another example, under a design where 
schools are randomly assigned within districts, schools could first be grouped on the basis of 
their average test scores and locations. 

Stratified sampling, however, reduces the number of degrees of freedom for statistical tests 
if stratum effects are included in the regression models (see equation (2) above) which could 
offset some of the precision gains from stratification. This precision loss, however, is meaningful 
only if small numbers of groups are randomly assigned to a research condition. 

An extreme form of stratification occurs when, prior to random assignment, only two units 
are assigned to each strata. This pairwise matching approach is sometimes used when only small 
numbers of units are randomly assigned to a research condition to avoid the possibility of 
obtaining a “bad draw.” For example, the SACD evaluation used this pairwise matching design 
to allocate the 10 to 18 schools within each site to the treatment and control groups. Schools with 
similar characteristics were first paired, and one school in a pair was then randomly assigned to 
the treatment group and the other in the pair was randomly assigned to the control group. As 
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discussed, this sampling approach is also used, by necessity, in designs where only two 
classrooms within a school are available for random assignment. 

Under designs with only one treatment and control group per stratum, there are no degrees 
of freedom available for estimating within- stratum group effects (Murray 1998). As discussed 
above for the case with only one classroom per condition per school, there are several 
approaches for dealing with this problem. One approach is to ignore the stratification in the 
analysis (which could increase the ICC estimates), while another approach is to use a random 
effects framework where stratum effects are treated as random (with the associated loss in 
degrees of freedom). For the second approach, the leading term in the variance formula for an 
impact estimate represents the extent to which impacts vary across strata (pairs). Diehr et al. 
(1995), Martin et al. (1993), Klar and Donner (1997) discuss the benefits of the various 
approaches when the number of pairs is small (and hence, where statistical power losses from 
pairwise matching could be severe). 



3. Using Regression Models 

For a given sample design, the most effective strategy for improving precision levels for 
group-based random assignment designs is to use regression models to estimate program 
impacts. The inclusion of relevant baseline student-, classroom-, and school-level explanatory 
variables in the regression models can increase power by explaining some of the variance in 
mean outcomes across schools and across classrooms within schools (that is, by increasing 
regression R 2 values). 

To demonstrate the power improvements from using regression (ANCOVA) models, we 
consider the design where the school is the unit of random assignment, and generalize equation 
(17) as follows: 



... o\ jr / * , 2(J 2 pA\- Rl s ) 

(18) Var (impact) = — — + 

.5s 



2a 2 p 2 (\-Rl c ) + 2a 2 (\- p x - p 2 )(1 - R 2 r ) 13 
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2 

In this expression, R bs is the proportion of the between-school variance that is explained by the 
regression model, Rr bc is the proportion of the between-classroom variance within schools that is 
explained by the regression model, and R 2 w is the proportion of the within-classroom variance 
that is explained by the regression model. Thus, the inclusion of explanatory variables that have 
significant predictive power in the regression models can substantially improve the precision 
levels of the impact estimates. The most effective explanatory variables are likely to be pre- 
intervention measures of the outcome variables, measured at the student, classroom, and 
aggregate school levels. 



13 As shown in Raudenbush (1997), a small correction factor needs to be applied to the variance formulas when 
group-level covariates are included in the regression model. 
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2 2 2 

It is important to note that it is possible, although unlikely, that R bs or sc(but not R fi) are 
negative if the distribution of the covariates across groups exacerbates differences across the 
groups (Murray 1998). Thus, regression adjustment methods do not necessarily reduce ICCs. 

The groups of covariates that can be included in the regression models will depend on the 
design. For instance, for the fixed-effects design where students or classrooms are randomly 
assigned to a research condition within volunteer schools and districts, the covariates cannot 
include school-level (or district-level) measures. This is because these measures will be perfectly 
collinear with the school indicator variables (see equation (7)). However, school-level covariates 
should be included if school effects are treated as random. 

The inclusion of covariates decreases the degrees of freedom available for statistical tests, 
but in ways that depend on the level at which the covariates are measured. For estimating 
variances at the group level, one degree of freedom is lost for each group-level covariate 
included in the model. Individual-level covariates, however, reduce the degrees of freedom for 
estimating the individual-level variance terms, but not the group-level terms. Thus, if available, 
individual-level covariates are preferred to group-level ones. Furthermore, because the degrees 
of freedom at the group level are critical for power, use of group-level covariates should be 
limited to those that have significant explanatory power in the regression models, and that adjust 
for residual measurable differences between the treatment and control groups. 

To obtain benchmark regression R values, we examined the fit of models using baseline and 
follow-up test score data on elementary school students from various data sources: (1) the 
LESCP, (2) the national evaluation of the 21st Century Community Learning Centers program, 
and (3) the TFA evaluation. Our analysis indicated that R 2 B s and R 2 w values were at least .50 in 
regression models that included student-level baseline test scores as explanatory variables. 
Gargani and Cook (2005) and Bloom et al. (1999) found similar values using test score data from 
Louisville, KY and Rochester, NY, respectively. However, in the absence of these pre- 
intervention measures of the outcome variables, R values were closer to .20. Because the 
amount and quality of baseline data vary across evaluations, we conduct our power calculations 
assuming conservative R 2 values of 0, .20 and .50. 



4. Including Finite Population Corrections 

When samples of students, classrooms, and schools are considered to be sampled from a 
finite population, the use of a finite population correction (fpc) reduces the variance of a sample 
mean by a factor equal to 1 minus the proportion of the population being selected. The gains 
from using the fpc can be substantial if a significant proportion of all population units are 
selected to the sample (because, under repeated sampling, there would be considerable overlap in 
the analysis samples). This is relevant for many group-based evaluations of education programs, 
because it is often the case that a large percentage of all relevant units are randomly assigned to a 
research status. Lor instance, for the SACD evaluation, half the population of schools per site 
were randomly assigned to the treatment group and half were randomly assigned to the control 
group. Similarly, for the Evaluation of the Effectiveness of Educational Technology 
Interventions (Dynarski et al. 2004), most teachers in participating schools and grades will be 
assigned at random to use a technology intervention or not. Thus, it is worth considering whether 
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the use of a finite population correction increases precision levels for impact estimates under 
group-based experimental designs. 

In order to address this issue, we first note that for the multi-stage designs that we have 
considered, randomization at each stage takes one of two forms: (1) the random assignment of 
units to a research condition, or (2) the random selection of units to the sample from a larger 
universe of units. For example, under some school-based experimental designs, schools are 
randomly assigned to a research condition (the first type of randomization) and classrooms are 
then randomly selected within the study schools (the second type of randomization). The 
variance formulas for such multi-stage designs include terms that account for both sources of 
randomization. 

The fpc does not apply to variance terms associated with the random assignment of units 
(the first type of randomization) when a large percentage of all units are randomly assigned. This 
is because there is a negative correlation between the treatment and control group means that 
cancels the gains from using the finite population correlation. To fix ideas, consider a design 
where 100 students within purposively- selected schools are randomly assigned to a treatment or 
control group. Then, if, by chance, average test scores for the 50 treatment group students are 
larger than average test scores of all 100 students, then, by definition, the 50 control group 
students will have lower-than-average test scores. Because the variance of an impact estimate 
equals the sum of the variances of the treatment and control group means minus twice the 
covariance between the two means, a negative correlation between the means increases the 
variance of the impact estimates, and directly offsets the precision gains from using the fpc. 

The fpc, however, does apply for variance terms associated with the random selection of 
units (the second type of randomization). For example, consider a three-stage sample design 
where schools are randomly assigned to a treatment or control group in the first stage, 
classrooms are randomly sampled within schools in the second stage, and students are randomly 
sampled within classrooms in the third stage. Then, the finite population correction applies to 
the classroom- and student-level variance terms, but not to the school-level term. Using earlier 
results, the variance formula can be written as follows: 



(19) Var (impact) = 
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where K is the total number of classrooms per school and N is the total number of students per 
classroom (and where we have omitted the regression R terms). Thus, the classroom-level 
effect is reduced as the sampling fraction of classrooms is increased, and similarly for the 
student-level effect. However, the finite population correction does not affect the school-level 
term. Clearly, if the population universe is assumed to be infinite, the finite sample corrections 
do not enter the variance formulas. 
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5. Accounting for Longitudinal Observations and Repeated Measures 

For many evaluations of education programs, longitudinal data are collected on sample 
members at baseline and at various follow-up time points to examine changes in impacts over 
time. In this section, we discuss appropriate variance formulas for impact estimates using 
longitudinal observations where time is modeled either as a fixed effect or a linear time trend in 
the HLM framework. We examine also the case where repeated measures are collected on units 
within a time period. We use results found in Koepsell et al. (1991), Murray et al. (1998), 
Murray and Blitstein (2003), Klar and Darlington (2004), and Janega et al. (2004). For 
illustrative simplicity, we focus on the design where schools are the unit of random assignment 
and where classroom-level clustering is not present, although the results can be easily applied to 
other designs that we have considered. 



a. Modeling Time as a Fixed Effect 

Suppose that comparable test score data are available at baseline and at several follow-up 
points. As discussed, one procedure for incorporating the baseline data into the posttest analysis 
is to include the baseline test scores as covariates in the regression models. Another procedure is 
to treat the baseline test scores as a dependent variable along with the follow-up test scores and 
to include time effects in the regression models. 

Consider first a design where, within the study schools, data are collected on different 
cohorts of students during the baseline and follow-up periods (which would occur, for instance, 
if data were collected on only third grade students in each period). Consider also the following 
two-level HLM model, where Level 1 pertains to the student and Level 2 pertains to the school 
(the unit of random assignment): 



(20) Level 1 : Y ipq = f )p + YjK<i F m + e m 

q=2 

Level 2 : f )p = y 0 + yf p + u p 

\ pq =K +S u T P +t pi’ 



where Y ipt are standardized test scores of student i in school p at follow-up point q (q=l,..,l), 
where period q= 1 corresponds to the baseline period; F ipq is an indicator variable equal to 1 for 
observations at follow-up point q\ T p is a treatment status indicator variable for school p: u p are 
iid N( 0, <f u ) school- specific random error terms (at baseline); r pq are iid N(0, <f r) error terms 
which represent the extent to which school effects vary over time during the follow-up period 
(relative to the baseline period); e ip are iid N(0,cr e ) student-level residual error terms that are 
distributed independently of u p and T pq \ and the remaining terms are parameters. 
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Inserting the Level 2 equations into the Level 1 equation yields: 



(2D r lrl = r„+ r t r p + + 2X(r , * F <„) + +e t»r 
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In this formulation, Sj q represents the impact in follow-up period q, and is the treatment-control 
difference between the mean posttest score in period q relative to the mean pretest score in 
period 1 (that is, \Y_. q T-Y,j T \ - [Y.. q c -Y..ic\)- Because the u p terms cancel in this difference-in- 
difference estimator, the variance of the impact estimate is: 
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Using earlier results, this variance formula can also be expressed as follows: 



(23) Var {impact) = 2[ ^ <7 " ^ — — H — — ^ — ], 
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where C 4 represents the correlation between mean test scores within a school over time. 

It is an empirical issue whether conducting a pretest-posttest analysis yields more efficient 
estimates than conducting a posttest analysis only with the pretest scores included as covariates 
in the regression models. This issue largely depends on the extent to which student outcomes 
within a school vary over time (that is, on C4) and the predictive power of the pretest scores in the 
posttest regression models. Janega et al. (2004) provide evidence using data from the TEENS 
study that the regression-adjusted posttest analysis is the more powerful technique. 

Finally, we note that equations (22) and (23) are applicable also to the case where data on 
the same students are collected over time in the study schools, and where there is no repeated 
testing of students within the same time period (Murray 1998). This is because, although time- 
by-student random effects can be included in the models, time-by-student and within-student 
variability are not separable. 



b. Linear Trend Analysis 

In education research, growth-curve analyses are often conducted to examine intervention 
effects on the growth trajectories of student outcomes. In these analyses, longitudinal 
observations are modeled as a function of time (measured, for example, as the number of months 
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or years from random assignment until data collection). In its simplest form, time can be 
modeled as a linear trend, in which case equation (21) can be modified as follows: 



(24) Y. = r 0 + y x T r 
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where t q is the time between random assignment and the collection of observation q 
(appropriately centered). 

In this model, the intervention effect is Si, which represents the treatment-control difference 
in the estimated slopes from regressions of test scores on time. Standard regression theory shows 
that this impact estimate can be expressed as a weighted sum of the (/-l) difference-in-difference 
estimators discussed above with weights ( t q -t.yL(t q -t .) 2 . The variance of this impact estimate is: 
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where L is the length of the follow-up period (see Koepsell et al. 1991 for a similar expression). 
This variance expression tends to decrease as the number of time periods increases. Murray 
(1998) provides more general versions of variance formulas for growth curve models that allow 
for covariates and random time trends. 



c. Accounting for Repeated Measures 

In some evaluations, repeated measures are collected on subjects at each data collection 
point. For example, in the Evaluation of the Impact of Teacher Induction Programs, researchers 
plan to observe teacher practices twice per data collection point. The presence of repeated 
measures increases the effective sample size for the analysis, and hence, increases the precision 
of the impact estimates. The effective sample size will depend on the correlation of the repeated 
measures. 

To quantify the extent to which repeated measures improve precision levels, we consider a 
three-stage HLM model, where Level 1 refers to measurement m. Level 2 refers to subjects, and 
Level 3 to schools (the unit of random assignment). In this case, the treatment effect for a single 
posttest period can be estimated using the following expression: 



(26) Y ipm - y 0 + y x T p + [u p +T jp +e ipm \. 
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